課程大綱

課程資訊

課程名稱	大數據分析專題 Seminar on Big Data Analytics
開課學期	107-1
授課對象	社會科學院政治學研究所
授課教師	張佑宗
課號	PS5687
課程識別碼	322 U2050
班次
學分	2.0
全/半年	半年
必/選修	選修
上課時間	星期四8,9(15:30~17:20)
上課地點	社科研605
備註	初選不開放。政治思想，國際關係，公共行政，本國政治，比較政治。限學士班三年級以上且限本系所學生(含輔系、雙修生) 總人數上限：38人
Ceiba 課程網頁	http://ceiba.ntu.edu.tw/1071PS5687_BDATA
課程簡介影片
核心能力關聯	核心能力與課程規劃關聯圖
課程大綱
為確保您我的權利,請尊重智慧財產權及不得非法影印
課程概述	Data science is a field with goals overlapping with many disciplines, in particular, mathematics, statistics, algorithms, engineering, or optimization theory. It also has wide applications to a number of scientific areas such as natural sciences, social sciences, life sciences, business, or medicine. Data science has become an integral part of many research projects and started affecting social science reaches. The promise of the “big data” revolution is that in these data are the answers to fundamental questions of businesses, governments, and social sciences such as political science and sociology. Most importantly, these quantitative techniques provide ``better predictions'' across different systems. Many of the most astonishing results come from computational fields, which have little experience with the difficulty of social scientific inquiry. As social scientists, we have an extensive experience and observations of our own research fields and we can utilize the advance of these new computational methods to our studies. The course objective is to study the theory and practice of constructing algorithms that learn from data. This is an applied graduate level course for social scientists. Students will learn practical ways to build machine learning solutions for their own researches. While some mathematical/statistical details are needed, we will have an overview of the quantitative tools we need and emphasize the methods with their conceptual underpinnings rather than their theoretical properties. Specifically, the course will cover: k-nearest neighbors methods, the naive Bayes method, decision trees, random forests, boosting, k-means clustering and nearest neighbors, kernels, scaling, and ensemble learning. We will also discuss topics related to best practices, including overfitting/underfitting of data, error rates, cross-validation, and the use of bootstrapping methods to develop uncertainty estimates. Statistical Software: R is a programming language and free software environment for statistical compu
課程目標	By the end of this course, students should be able to: (1) Understand the fundamental concepts and applications of data science. (2) Learn the advantages and shortcomings of widely used machine learning algorithms. (3) Uncover patterns and structure embedded in data with machine learning methods. (4) Test and improve model specification and predictions. (5) Apply their learning to a social science research project. As a result, we hope that this course will appeal not just to mathematicians/statisticians but also to researchers in a wide variety of social science research fields.
課程要求	Prerequisites: One-year of calculus, basic linear algebra, basic probability theory, applied statistics, proficiency in Python/R/MATLAB or permission of the instructors. Grading Policy: Quizzes ……………………………………… 10% Assignments ……………………………… 30% Midterm ……………………………………. 30% Final Exam …………………………………. 30% Assignments: There will be 5-6 problem sets during the semester, with 3-5 questions apiece, drawn mostly from the two textbooks. The datasets we will be using, but not limited to, are mainly fields of social sciences and business. You are encouraged to discuss with your classmates about the problems, but you must write and turn in your own answers. To be blunt, rote copying of an answer from your classmates or other sources is a waste of your time and the grader's time. Class Policy: 1. An important component of this course is active engagement with the material in classes. Regular attendance is essential and expected. 2. Quizzes are closed book, closed notes. 3. No makeup quizzes will be given. 4. No foods in class. Academic Honesty: Lack of knowledge of the academic honesty policy is not a reasonable explanation for a violation. There will be 5-6 problem sets during the semester, with 3-5 questions apiece, drawn mostly from the two textbooks. The datasets we will be using, but not limited to, are mainly fields of social sciences and business. You are encouraged to discuss with your classmates about the problems, but you must write and turn in your own answers. To be blunt, rote copying of an answer from your classmates or other sources is a waste of your time and the grader's time.
預期每週課後學習時數
Office Hours
指定閱讀	待補
參考書目	一、指定閱讀(請詳述每週指定閱讀) Main References: There are two required books for the course: 1. An Introduction to Statistical Learning: with Applications in R. Springer, 2009. Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. 2. Applied Predictive Modeling. Springer, 2013. Max Kuhn and Kjell Johnson. 二、延伸閱讀(請詳述每週延伸閱讀) Here are some recommended readings. Students are not required to read all of these books prior to class. 1. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2. All of Statistics: A Concise Course in Statistical Inference. Springer, 2013. Larry Wasserman. 3. Python Machine Learning. PACKT Publishing, 2015. Sebastian Raschka.
評量方式 (僅供參考)

課程進度

週次	日期	單元主題
第1週	9/13	Overview of Data Science
第2週	9/20	Review Session 1
第3週	9/27	Review Session 2
第4週	10/04	Linear Regression 1
第5週	10/11	Linear Regression 2
第6週	10/18	Classification 1
第7週	10/25	Classification 2
第8週	11/01	Resampling Methods
第9週	11/08	Midterm Exam
第10週	11/15	No class due to school anniversary
第11週	11/22	Linear Model Selection
第12週	11/29	Regularization
第13週	12/06	Nonlinear Methods 1
第14週	12/13	Nonlinear Methods 2
第15週	12/20	Tree Based Methods 1
第16週	12/27	Tree Based Methods 2
第17週	1/03	Unsupervised Learning
第18週	1/10	Final Exam